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Abstract 

For optimization problems associated with engineering design, parameter estima- 
tion, image reconstruction, and other optimization/simulation applications, low accu- 
racy function and gradient values are frequently much less expensive to obtain than 
high accuracy values. We investigate the computational performance of trust region 
methods for nonlinear optimization when high accuracy evaluations are unavailable or 
prohibitively expensive, and confirm earlier theoretical predictions than the algorithm 
is convergent even with relative gradient errors of 0.5 or more. The proper choice 
of the amount of accuracy to use in function and gradient evaluations can result in 
orders-of-magnitude savings in computational cost. 


*This research was supported by the National Aeronautics and Space Administration under NASA Con- 
tract No. NAS 1-18605 while the author was in residence at the Institute for Computer Applications in 
Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23665. 
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1 Introduction 


Consider the nonlinear optimization problem 


minimize /(x) (1) 

x 6 5? n 

where the function / has gradient V/, with V/ assumed to be Lipschitz continuous. We are 
concerned with numerically solving this problem when function and gradient values are not 
known exactly. 

Problems of this nature frequently occur in engineering design, parameter estimation, 
and many other situations. Consider, for example, the design of a heat sink for transferring 
excess heat away from an electronic component. Given the geometry of the sink (expressed, 
perhaps, as the spacing, thickness, and length of each cooling fin) and the heat flux from 
the component, one can mathematically model the temperature distribution in the sink and 
the surrounding medium by a system of partial differential equations. If we wish to find 
the design geometry which minimizes the time-averaged temperature of the component, we 
must numerically solve this system of PDE’s at each iteration of the optimization algorithm 
to determine a value for the objective function f . Furthermore, a value for the gradient of / 
must be computed at each iteration either through successively perturbing each component of 
x and recomputing / to obtain a finite difference approximation, or through directly solving 
a larger system of PDEs. Clearly, exact function and gradient values are not attainable, and 
the computational expense of any approximation at a given iteration increases very rapidly 
as the required accuracy is increased. Let h be the discretization mesh size selected and 
m<j be the number of spatial dimensions in the PDE, so that the number of elements in the 
discretization is given by k = o((l/h) md ). If the order of the solution method is given by 
error = o(h m ° ), and the computational expense for the linear algebra associated with the 
problem is CPU = o(k mi ), we have 
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CPU/iteration = o [error 1 ) (2) 

for l = • Although this estimate is admittedly crude, it seems to hold for many 

applications and indicates that computational expense per iteration can rise extremely rapidly 
as more accurate solutions are required. In our example, if a full three-dimensional model of 
the sink is being used with a direct linear algebra solver and an o[h 2 3 ) solution method, we 
would have rrid = 3 , mi = 3 , m 0 — 2; hence l would be 3 and a thousand-fold increase in 
computational time would be needed to increase the accuracy of any given approximation by 
one digit. Problems involving systems of ODEs tend to be more benign with much smaller 
values of to ~), but computational expense still increases geometrically with accuracy. 

Since low accuracy function and gradient evaluations can be orders of magnitude less 
expensive than high accuracy evaluations, it behooves us to consider optimization algorithms 
that do not require the maximum possible accuracy at each iteration. Trust region methods 
are a natural candidate for investigation because of their reputation for robustness and 
efficiency. A number of authors have established global convergence results for trust region 
methods using inexact gradients ([12], [3], [IT]), and inexact function evaluations have also 
been treated [4]. In this paper, we investigate the numerical behavior of the algorithms 
presented in [3] and [4] in order to answer the following questions. 

1. How much error can we allow in our evaluations before the algorithm fails? Does this 
level agree with theoretical predictions? 

2. The performance of the algorithm will certainly decrease when less accurate evaluations 
are used. How fast does performance degrade and how problem-dependent is the rate 
of degradation? 

3. How does this lessened performance balance with the greatly decreased computational 
cost associated with less accurate evaluations? 
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4. How well do the techniques suggested in [3] for estimating and controlling gradient 
error work in practice? 

The remainder of this paper is organized as follows. 

In Section 2, we present the trust region algorithm and review the conditions on permis- 
sible levels of error established in [3]. These conditions depend on some of the parameters of 
the trust region method but are remarkably relaxed: typically relative errors in the gradient 
of 0.5 or more are permissible. In Section 3, we examine the performance of the algorithm 
on the set of standard test problems from More, Garbow, and Hillstrom [13] when synthet- 
ically generated errors are added to the gradients at each iteration. Our results confirm 
the theoretical predictions for the algorithm, and we note that the number of iterations 
required by the algorithm tends to increase exponentially with the relative error induced 
in the gradient. When balanced against (2), however, we find that allowing low accuracy 
evaluations is still attractive. In Section 4, we examine the performance of the algorithm 
on a parameter identification problem found in the literature in order to confirm our results 
without resorting to synthetically induced errors or invoking (2). We also verify a technique 
for estimating and controlling error when the gradient is approximated by finite differences. 
Section 5 summarizes our results. 

2 The trust region algorithm and permissible error 
in function and gradient evaluations 

The trust region method for solving (1) generates a sequence of iterates-fx*} by approxi- 
mately solving a sequence of constrained quadratic model problems. Each local quadratic 
model is of the form 


'Mzfc + s) 


fk + gls + -s T B k s 


( 3 ) 
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where f k is an approximation to /(sc*), g k is an approximation to the gradient V/(sc*), and 
the symmetric matrix B k £ 3J nXn is an approximation to the Hessian matrix V 2 /(x*). At 
each iteration, we take z*+i = x k + s* , where s* is an approximate solution to the trust 
region subproblem 

minimize ^k{%k + *) s/t ||jD*s|| < A*. (4) 

s e$ n 


The positive scalar A* is known as the trust radius , and the nonsingular matrix D k £ 3 ? n 
is the scaling or preconditioning matrix (often taken to be a fixed diagonal matrix). At each 
iteration, A* is adjusted so that the ball < A* represents the region over which we 

expect (3) to adequately model the function /. 

A number of techniques are available for computing an approximate solution to (4). 
An excellent survey of the main classes of these methods can be found in [12]. In our 
computations, we chose to use an optimal locally constrained , or OLC technique [6]. Similarly, 
a number of techniques can be used to generate the Hessian approximation {£?*}, but we 
selected the popular BFGS secant update 


R d I (gfc+1 - 9k) Cgfc+i - 9kf_ _ BkfksZj^ 

k+1 ~ k (gk+i - gk) T s k s T k B kSk 


( 5 ) 


when (g k +i — g k ) T $k > 10 6 ( 5 *+i — gk) T (gk+ 1 — gk), and B k+1 = B k otherwise. Since g k is 
only an approximation to V/(x*), [4] and [3] suggest that upper bounds of the form 


\\B k \\ < Cx 
or 


( 6 ) 


glB k g k < c x glg k (7) 

be directly enforced for some appropriately large cj. This could be easily done by using the 
replacement operation 
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B k := min{l, ( 8 ) 

at each iteration. Although bounds such as (6) and (7) are needed to establish convergence 
results, in practice we found (8) unnecessary. 

A simple version demonstrating the salient features of the trust region algorithm is as 
follows. 

Algorithm (1): The trust region method. 

Let the constants 0 < r}\ < rj 2 < rj 3 < 1 be prespecified. Select an initial guess i 0 E and 
an initial trust radius A 0 . Compute / 0 and g 0 , and compute or initialize B 0 . For k = 0, 1, . . . 
until “convergence” do: 

(a) Determine an approximate solution s k to problem (4). 

(b) Calculate the predicted function reduction. 


Fredas*) = -gls k - -slB k s k 


and the computed function reduction 


( 9 ) 


credos*.) = f k - /jt+i- 

If necessary, recompute f k+ i and/or f k to greater accuracy, 

(c) Compute the ratio 


( 10 ) 


_ credfc(sfc) 

Pk P re djt(-sjt) ' 

(d) If p k < r)i, then the step is unacceptable. Set A* := ^jA* and return to (a). 


( 11 ) 
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(e) If 771 < p k < V2, then set A* +1 = jA fc , 


Else if 773 < Pk < (2 — 773), then set A* + i = 2 A*, 

Else set A^ +1 = A*. 

(f) Set xjt + i = x k + -s*:, compute g k +i, and compute or update B k +i- 

End loop 
End algorithm 

Typical values for the step acceptance/ trust radius update parameters are 771 = 0.001, t ) 2 = 
0 . 1 , and 773 = 0 . 75 . Notice that no step is accepted unless p k > rjx, and that the trust radius 
is never reduced unless p k < 772. Further notice that g k is only computed once per major 
iteration. 

Two conditions are required of the approximate function values. Define 



ared*(sjt) = f(x k ) - f(x k + s k ). 

(12) 

We then require 




|ared fc (s fc ) - cred*(s*)| > C/,i pred fc (s fc ) 

( 13 ) 

and 




|aredfc(s*) - cred*(s*)| < C /,2 |cred*(s*)| (14) 

for some constants (f t i and ( y t2 • A stronger variation of these conditions that is typically 
more practical is 


|/jfe+i - f( x k+i)\ + \fk - /(z*)| < C /,1 pred Jt (s Jfc ) (15) 


fk+ 1 - f( x k+i)\ + I fk ~ fM\ < C /,2 |credjt(s fc )| (16) 
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Assuming error estimates are available for | /* — /(x*.)|, [3] suggests that (15) and (16) be 
enforced by using the following procedure in place of step (b) of Algorithm (1). 


Procedure 2 

Let a 6 (0, 1) be prespecified. Given x&, and an estimate for | fk — f(xk)\> do the 

following. 

(i) Calculate pred Jt (s Jt ) and emax = ^xpred^s*). 

(ii) If necessary, recompute /*, to greater accuracy so that |/* — /(xfc)| < a emax. 

(iii) Compute f k +i so that \fk+i — /(®Jt+i)| < (1 — a)emax. 

(iv) Compute credos*). If (16) is satisfied, then exit procedure, else reduce emax and 
return to (ii). 


End procedure 

The permissible level of error in the gradient evaluations can be characterized in two 
different ways. The preferable condition is 


for some constant ( g , with 


<t 


(17) 


ek = 9k~ V/(x fc ). (18) 

Under appropriate assumptions on / and {-Djt}, equation (17) leads to the strong global 
convergence result lim*^ \\g k \\ = lim*^*, ||V/(x fc )|| = 0. The weaker convergence result 
liminfjfc_oo ||pfc|| = 0 can be obtained using the condition 

We k f{D^g k )^ (1Q) 
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If each scaling matrix is taken to be the identity, (17) and (19) become simply 


< (, ( 20 ) 

and 


Finally, we must specify the allowed 
values are given by the inequalities 

Cg + 0 ,i <1-172 ( 22 ) 

and 


e k9k 

9k 9 k 


<<,■ 


( 21 ) 


values for the error bounds C/,i,C/,2 and 0,3- These 



0 < 0,2 < 1, (23) 

with 0 — 0 an d 0.1 — 0 - These limits are remarkably generous. If tj 2 (the parameter 
controlling trust radius reduction) is 0.1, we could select 0,i = 0.05,0,2 = 0.99, and ( g = 0.8 
— less than one significant bit of accuracy in the components of the gradient approximation. 

3 Algorithm Performance on More-Garbow-Hillstrom 
Test Problems With Synthetically Induced Errors 

The More-Garbow-Hillstrom test set [13] contains eighteen“typical” optimization problems. 
These problems are all algebraically defined, and thus function and gradient values are 
both inexpensive to compute and available to high accuracy. In order to test the effects 
of gradient error, we synthetically induced a random error into each gradient computed. 
More specifically, we computed a vector w k with each component selected from a uniform 
distribution on [-1,1] and then set g k — V/(x*) + e k with 


e k = 100 *w k * ]| V/(x fc )||/2 m , 


(24) 
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where m is the smallest positive integer for which (20) is satisfied. For our first set of tests 
no errors were induced in the computation of function values. 

i 

!>ur optimization code, we used an implementation of the Dennis-Schnabel routines 
[ an OLC step computation technique. The model Hessians Bk were computed using 
.dard BFGS procedure (5). We emphasize that no special techniques were used to 

[for gradient error when computing i?*, nor were safeguards such as (8) enforced. 

i 

Ue for r ]2 in the optimization code was 0.1, so that the theory in [4] and [3] suggests 
r limit of 0.9 for ( g . 

sach problem in the test set, we considered twenty different values of ( g ranging 
i to 0.95. For each value of ( gy we ran the algorithm with a number 1 of different 
itions for the random number generator used to compute e* values, and tabulated 
jmum, median, and maximum number of iterations required to converge to a local 
jbr. In all, over 5000 test cases were run on a network of SUN 3/50 workstations in 
|AN 77 double precision. 

^es (1) through (5) show plots of our results for selected problems. Let K(( g ) denote 
[ber of iterations required for convergence as a function of relative gradient error, 
tical axis in each plot represents the natural log of if(Cs) > while the three traces in 
t represent the observed minimum, median, and maximum values, 
res (1) through (3) are typical of most of our test cases, in which performance 

j exponentially: 


K{( g ) « K(0) exp (b( g ) t (25) 

with observed decay coefficients b ranging from roughly 2 to 6 for the various problems. 
Figures (4) and (5) represent anomalous cases where ln(if(C$)) has significant variation from 
linearity at small and large values of Cg- 

Although the exponential performance degradation given by (25) is a telling argument 

1 Depending on the computational expense associated with a problem and the observed variability of 
results, between 5 and 100 test cases were run for each value of ( g . 
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ln(K(C,)) 









Figure 3: Number of iterations versus for the Extended Powell Singular Test Function. 



Figure 4: Number of iterations versus ( t for the Gaussian Test Function. 
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Figure 5: Number of iterations versus ( g for the Trigonometric Test Function. 

against using low accuracy gradients if high accuracy evaluations are obtainable without 
greatly increased computational expense, low accuracy evaluations are still attractive in 
cases where the computational expense increases rapidly with increasing accuracy. Suppose, 
for instance, that 


CPU/iteration ss 1 (26) 

for some constants c a and l. The total computational expense for solving a given problem 
will then be proportional to Figures (6) through (10) show the predicted total 

computational cost of the median curves for K(( g ) in figures (1) through (5) for the values 
^ = a> an< l 2 (with each curve normalized so that the minimum value is 1.0). 

Notice that each curve of total computational cost increases very rapidly as (- — ► 0 or 
C o 1> has a relatively large, flat minima. This behavior holds for both the “typical” 
cases (figures (6) through (8)) and the anomalous cases (figures (9) and (10)). Interestingly, 
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Figure 6: Computational coat profiles for the Watson Test Function. 



Relative error In gradient « # ) 

Figure 7: Computational cost profiles for the Brown and Dennis Test Function 
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Figure 10: Computational cost profiles for the Trigonometric Test Function, 
combining the idealizations (25) and (26) into 


CPU m A'(0)c a C- , exp(K,) 
yields the theoretical "best” value 


(27) 


/* — I 

Sfl I * 


(28) 


Using the observed “typical” value 6=4 yields the rule of thumb choice 


c; = j. < 29 ) 

which worked quite well for all of our test problems. 

A number of variations on these numerical tests were also tried. Rather than (26) we 
also considered the idealization 
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CPU/iteration » c 2 ( g 1 + c 3 , (30) 

where c 3 represents a fixed “overhead” cost per iteration. For moderate values of c 3) the 
character of the overall computational cost behavior remained unchanged. Errors in function 
values were also considered with ( g fixed at 0.1. The algorithm proved to be quite insensitive 
to these errors for (f tl < 0.5. Indeed, the algorithm works in practice even if > 1 provided 
the average value of |ared fc (s fc ) — credos*.) | / pred^s*) is sufficiently less than 1 — 772 — ( g . 
As pointed out in [4], this is a very reasonable result since function values are only used to 
update the trust radii, and a mistake at any given iteration will not cause the algorithm to 
fail. 


4 Algorithm performance on a parameter identifica- 
tion problem 

The test problems of the last section are widely recognized as representing “typical ” op- 
timization problems, and because they were algebraically defined, we were able to run an 
enormous number of test cases to examine the ranges of possible behavior in the presence of 
errors. It should be remembered, however, that our interpretation of these results rests both 
on the character of the synthetic noise added to each gradient evaluation and on idealization 
(26). In order to verify our results, we also tested our algorithm on the following parameter 
identification problem from [ 10 ] and [ 11 ]. 

Consider the accidental release of the radioactive gas tritium into an enclosure surround- 
ing a nuclear reactor. The tritium will react with water vapor in the containment to produce 
other tritium-based species via 


T 2 + H 2 0 ^ HT + HTO, (31) 

and some of the HTO may be adsorped into the surface of the containment. This adsorped 
tritium species represents a significant clean-up problem. 
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Given the reaction rate constant in (31) and the adsorption and release rates of HTO on 
the surfaces of the containment, the physical problem can be modeled by a system of four 
coupled initial value ODEs: 


Y\t\ x) = h(t, Y{t ; x)), Y( 0 ; x) = y 0) (32) 

with the components Y : 5ft x 5ft 3 — > 5ft 4 being species concentrations. Unfortunately, the 
rate constants are not directly measurable. Maroni et al performed an experiment in which 
a known amount of tritium was introduced into a small enclosure and the total tritium 
concentration Y\ + Y 2 + Y 3 was measured at m discrete time points. The rate constants 
Xi 7 x 2} and z 3 can then be estimated by minimizing / : 5ft 4 — » 5ft with 


1 


y £Y j (t i ,x)-O i + x 4 
_i=i 

where Oi is the observed experimental concentration 


/ 0) = oE 

z ;=i 


/m Oi 2 , 


(33) 


O i = Y 1 (t i ) + Y 2 (t i ) + Y 3 (t i ), (34) 

and x 4 is an additional variable representing an unknown experimental bias in the instrument 
for measuring (34). 

Equation (33) is a classical inverse problem. Note that each function evaluation for a 
given iterate Xk involves the numerical solution of four coupled ODEs. Gradient values 
can be computed via finite differences, or by the numerical solution of a system of sixteen 
coupled ODEs. Although the latter technique is usually preferable in practice, we used the 
more difficult approach of estimating g k by finite differences so that we could investigate a 
techniques suggested in [3] for estimating and controlling gradient error. 

Our numerical experiments with (33) were designed as follows. In order to approximate 
/ at a given point x ky (32) was solved using ODEPACK [7], which uses an adaptive solution 
technique. An important feature of ODEPACK is that it allows a desired level of accuracy 
(either absolute or relative) to be prespecified for each component of Y. In order to achieve 
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an accuracy of, say, e*. in /(x*), we specified a desired accuracy of in each component 

of Y, where cik-i was the amount of accuracy lost due to cancellation in evaluating (33) at 
the last iteration. This simple procedure worked remarkably well: the actual error in /*. was 
typically in the range [l/lOe*, 2e*] in our preliminary tests of this technique. 

Each gradient was initially approximated by a central difference formula using 2n extra 
function evaluations, where each function evaluation was computed with a specified desired 
relative accuracy of £*. Denote this approximation ~g k . We then computed a more accurate 
estimate of the directional derivative of / in the direction g k (or (DjD*.) -1 ^ if the scaling 
matrix is not the identity) as suggested in [3], using the formula 


^ it = 2 £^(/( x * + ^ k9k ) — f( Xk ~ & k 9k))- (35) 

Each function evaluation in (35) was computed with a specified desired relative accuracy of 
ejfc/10. The perturbation length Sk was taken to be 


Sk = 



(36) 


a value expected to perturb two-thirds of the accurate digits of /. Using (18) and (35), we 
can then estimate the error term in (21) via 


9k9 k 


= 1 


V/(**) T & 


9k9 k 


1 - 


I \9k\\ 2 ' 


(37) 


If this error was significantly larger than the desired gradient error level ( g , then £* was de- 
creased before the next iteration; if it was significantly smaller e* was immediately decreased 
and ~g k was recomputed ( in practice this seldom occurred except at the first gradient com- 
putation . ) Figure (11) shows the agreement between actual and requested gradient error. 
Even with the simple procedures used to adjust £*., the error estimate given by (35) and (37) 
allows us to control, with reasonable certainty, the level of accuracy in g k . 

The approximation g k can be further improved at no additional cost by setting 
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1111 I 1 J- — 1 

icr 8 icr 6 io ~ 4 io - 2 

Requested Gradient Accuracy 


Figure 11: Actual versus Commanded error in the gradient for the Tritium Parameter Esti- 
mation Problem. 


Sk " hj*V ? ‘ (38) 

so that 

e k£k _ l _ V/(sfc) r & (39) 

9 k 9k 3 fc 

Figure (12) shows the CPU time required to achieve a given level of accuracy using (38) in 

addition to the previously discussed procedure for adjusting T*. We see that computational 

expense increases geometrically as accuracy increases. 3 

Given these methods of evaluating /* and gk to some specified accuracy, we recorded the 

3 The idealised cost profile (26) yields a very close fit to this plot if Z is taken to be fo. However, tests 
with different values of z* showed that better empirical form this problem is 

CPU/gradient evaluation w 

Nonetheless, the rule-of-thumb choice (29) with l = & proved to be close to the optimal selection in our 
numerical tests. 
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3.00 



Figure 12: Solution time required for one gradient evaluation in Tritium parameter identifi- 
cation problem. 

computational time required to solve (33 ) for a number of different values of ( g , and for the 
following 3 cases. 

1. Each fk value was computed to high relative accuracy (10~ 8 ), and each g k value was 
computed as previously described including the correction (38). 

2. Each was computed to high relative accuracy (10 -8 ), and each g % was computed as 
described previously but without doing correction (38). 

3. Each f k was computed using Procedure 2 with £/,i = 0.1 and (f >2 — 0.99, and each g k 
value was computed as previously described including correction (38). 

A somewhat different implementation of the trust region method was used rather than the 
Dennis-Schnabel code used in the last section. First, we included nonnegativity constraints 
on the first three components of x to be consistent with the physics of the problem. This 


20 



Requested Gradient Accuracy at Each Iteration 

Figure 13: Solution time for numerical optimization of the Tritium parameter identification 
problem. 

was done by replacing the ellipsoidal trust region in (4) with the rectangular trust region 
||£ > fc'»||oo < A k, and by using a quadratic programming code to exactly solve the trust 
region subproblem subject to the nonnegativity constraints on x. Second, we used the 
Hessian safeguarding techniques proposed in [2] in addition to the standard BFGS update 
(5). Third, we used a simplified stopping criteria for comparison purposes. Each test case 
was terminated when /* — /*< k>o(/o — /*)> where /* was the optimal value of /. 

Figure 13 shows the results of our tests. 

Case 1 was tested for 15 different values of ( g . Note that the total computational time 
required is less than 8000 seconds for ( g = 0.15, but rises to almost 16000 and 24000 seconds 
for ( t = 1.5 x 10-« and (, g = 0.25, respectively. Fewer data were collected for cases 2 and 
3, but note that the correction (38) appears to make little difference to the algorithm when 

e k9k/gk9k i» small. On the other hand, using Procedure 2 rather than computing each /* 
to a fixed accuracy of 10 -8 resulted in a moderately faster algorithm. 
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In addition to the above 3 cases, a number of other numerical tests were made. Rather 
than keeping £ g fixed throughout the algorithm, we tried setting 

,*+■ _ / 1/10 

\ ’"“{Cj/2, 10 '} 

or conversely 

= 1 10-5 t 
^ [ mm{ 2Cg , 1/10} 

The idea behind (40) is to try to obtain the fast local convergence properties of the BFGS 
method when accurate gradients are available, while the idea behind (41) is to avoid highly 
accurate gradient approximations near the solution where they are likely to be most expen- 
sive. Interestingly, both of these approaches performed similarly, requiring 8887 and 10647 
seconds, respectively. 

Although the correction (38) appears to be of little use when e^g k / 9 k 9k is already small, 
it does appear to be useful in preventing the algorithm from failing due to occasionally en- 
countering highly inaccurate gradient evaluations. In tests where a large synthetic error was 
added to the gradient approximation every p iterations, the algorithm was much more robust 
when (38) was used (although the algorithm did still have problems involving convergence 
to a point with g k = 0 and V/( x k ) / 0, as predicted in [4].) In a similar vein, the gradient 
accuracy test given by (35) and (37) should be useful in verifying the accuracy of analytically 
derived gradients. 


if k = 0 

otherwise, 


if k = 0 

otherwise. 


(40) 


(41) 


5 Summary 

We have examined the numerical behavior of trust region algorithms for nonlinear optimiza- 
tion when function and gradient values are not computed exactly. This class of algorithms 
has proven remarkably robust, and can be successfully implemented even with very large 
errors in the function and gradient evaluations. 
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In a large number of tests using standard test problems with synthetically induced gra- 
dient errors, we observed that the algorithm performance, as measured by the number of 
iterations required for convergence, tends to degrade exponentially as the relative gradient 
error increases. This is a telling argument for using accurate evaluations provided they can 
be obtained at reasonable expense. For many optimization/simulation problems, however, 
the computational expense of these evaluations rises sharply with increasing accuracy, and 
low accuracy evaluations are again attractive. A good choice for the amount of relative 
gradient error allowed in the algorithm can result in orders-of-magnitude savings in compu- 
tational cost. If (26) holds, then the choice ( g = lj\ was nearly optimal for all of our test 
problems . 

Using a parameter estimation problem based on the numerical solution of a system of 
ODEs, we tested a technique for estimating and controlling the amount of error in a gradient 
approximation. This technique was very successful when used in conjunction with the “user 
specified accuracy” feature in the numerical differential equation solver ODEPACK. Actual 
computational costs for various values of relative gradient error were examined to confirm 
the behavior observed in the test problems with synthetically induced errors. 
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Figure 3: Number of iterations versus ( g for the Extended Powell Singular Test Function. 



Relative error in gradient « p ) 

Figure 4: Number of iterations versus ( g for the Gaussian Test Function. 
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Relative error in gradient (C y ) 


Figure 5: Number of iterations versus for the Trigonometric Test Function. 

against using low accuracy gradients if high accuracy evaluations are obtainable without 
greatly increased computational expense, low accuracy evaluations are still attractive in 
cases where the computational expense increases rapidly with increasing accuracy. Suppose, 
for instance, that 

CPU/iteration « c 2 £~ f (26) 

for some constants c 2 and l. The total computational expense for solving a given problem 
will then be proportional to ( ~ l K (( g ). Figures (6) through (10) show the predicted total 
computational cost of the median curves for K(( g ) in figures (1) through (5) for the values 
/ = |, 1, and 2 (with each curve normalized so that the minimum value is 1.0). 

Notice that each curve of total computational cost increases very rapidly as £ g — » 0 or 
lj and has a relatively large, flat minima. This behavior holds for both the “typical” 
cases (figures (6) through (8)) and the anomalous cases (figures (9) and (10)). Interestingly, 
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f igure 8: Computational cost profiles for the Extended Powell Singular Test Function. 



Figure 9: Computational cost profiles for the Gaussian Test Function. 
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Figure 10: Computational cost profiles for the Trigonometric Test Function, 
combining the idealizations (25) and (26) into 


CPU » K(0)c 2 (; l exp{b( g ) (27) 

yields the theoretical “best” value 



Using the observed “typical” value 6 = 4 yields the rule of thumb choice 


(28) 



(29) 


which worked quite well for all of our test problems. 

A number of variations on these numerical tests were also tried. Rather than (26) we 
also considered the idealization 
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Figure 11: Actual versus Commanded error in the gradient for the Tritium Parameter Esti- 
mation Problem. 


so that 



(38) 


ehk _ V/(**) T j7 fc 

9 k 9k d k 

Figure (12) shows the CPU time required to achieve a given level of accuracy using (38) in 

addition to the previously discussed procedure for adjusting We see that computational 

expense increases geometrically as accuracy increases. 2 

Given these methods of evaluating /*. and to some specified accuracy, we recorded the 

2 The idealized cost profile (26) yields a very close fit to this plot if l is taken to be However, tests 
with different values of x * showed that better empirical form this problem is 

CPU/gradient evaluation & C 2 ||( 7 fc||“ 1 Cj 1 ^ 10 . 

Nonetheless, the rule-of-thumb choice (29) with l = proved to be close to the optimal selection in our 
numerical tests. 



32 




— 7.00 — 4.90 - 2.80 —.70 

I n (Relative Gradient Error) 

Figure 12: Solution time required for one gradient evaluation in Tritium parameter identifi- 
cation problem. 

computational time required to solve (33 ) for a number of different values of ( g , and for the 
following 3 cases. 

1 . Each fk value was computed to high relative accuracy (10 -8 ), and each value was 
computed as previously described including the correction (38). 

2. Each fk was computed to high relative accuracy (10 -8 ), and each < 7 * was computed as 
described previously but without doing correction (38). 

3. Each fk was computed using Procedure 2 with (f t i = 0.1 and 2 = 0.99, and each gk 
value was computed as previously described including correction (38). 

A somewhat different implementation of the trust region method was used rather than the 
Dennis-Schnabel code used in the last section. First, we included nonnegativity constraints 
on the first three components of x to be consistent with the physics of the problem. This 
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Figure 13: Solution time for numerical optimization of the Tritium parameter identification 
problem. 

was done by replacing the ellipsoidal trust region in (4) with the rectangular trust region 
||-D/k-s||oo < A*, and by using a quadratic programming code to exactly solve the trust 
region subproblem subject to the nonnegativity constraints on x. Second, we used the 
Hessian safeguarding techniques proposed in [2] in addition to the standard BFGS update 
(5). Third, we used a simplified stopping criteria for comparison purposes. Each test case 
was terminated when /* — /*< Aj(/o — /*), where /* was the optimal value of /. 

Figure 13 shows the results of our tests. 

Case 1 was tested for 15 different values of ( g . Note that the total computational time 
required is less than 8000 seconds for ( g = 0.15, but rises to almost 16000 and 24000 seconds 
for ( g = 1.5 x 10 -5 and ( g = 0.25, respectively. Fewer data were collected for cases 2 and 
3, but note that the correction (38) appears to make little difference to the algorithm when 

ehk/glgk is small. On the other hand, using Procedure 2 rather than computing each /* 
to a fixed accuracy of 10 -8 resulted in a moderately faster algorithm. 
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